String Processing in R

Schweinberger, Martin

doi:10.5281/zenodo.19332968

String Processing in R

This tutorial introduces string processing in R using the stringr package, covering text manipulation operations including concatenation, splitting, replacement, pattern matching, and common text cleaning techniques. It is aimed at researchers in corpus linguistics, digital humanities, and the social sciences who need to prepare text data for analysis.

Author

Martin Schweinberger

Published

2026

Great Court, The University of Queensland

Introduction

This tutorial introduces string processing in R — the art of manipulating, searching, extracting, and transforming character data. String processing is a foundational skill for linguistic research: nearly every corpus study, text-mining project, or annotation pipeline begins with reading raw text and ends with cleaned, structured character data ready for analysis.

The tutorial is aimed at beginners and intermediate R users. It covers a progression from basic string operations in base R and the stringr package, through regular expressions, through text-cleaning pipelines, to tokenisation with quanteda. Each section introduces functions with linguistic examples and includes worked exercises.

Prerequisite Tutorials

Before working through this tutorial, you should be familiar with:

Getting Started with R — R objects, basic syntax, RStudio orientation
Loading, Saving, and Simulating Data in R — reading and writing files, file paths with here
Handling Tables in R — data frames, dplyr verbs, piping with |>

If you are new to R, work through Getting Started with R first.

Learning Objectives

By the end of this tutorial you will be able to:

Apply core base R string functions (nchar, paste, substr, gsub, grep, tolower, toupper)
Use the full suite of stringr functions for detecting, extracting, replacing, splitting, padding, and combining strings
Use str_glue() and str_glue_data() for string interpolation in reports and data pipelines
Work with factors as strings using forcats — relabel, reorder, collapse, and filter factor levels
Format strings for table output using padding, truncation, and number formatting
Handle Unicode, encoding issues, and non-ASCII characters (IPA, non-Latin scripts)
Write regular expressions including character classes, quantifiers, anchors, alternation, named capture groups, and lookahead/lookbehind
Build reproducible text-cleaning pipelines combining multiple string operations
Tokenise text using quanteda and understand the difference between word, sentence, and character tokenisation

Citation

Martin Schweinberger. 2026. String Processing in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/string/string.html (Version 3.1.1). doi: 10.5281/zenodo.19332969.

Setup

Installing Packages

Code

# Run once — comment out after installation
install.packages("tidyverse")   # stringr, dplyr, tidyr, purrr, ggplot2, forcats
install.packages("here")        # reproducible file paths
install.packages("flextable")   # formatted tables
install.packages("quanteda")    # tokenisation and corpus tools
install.packages("tm")          # text-mining utilities (stopwords, stemming)
install.packages("checkdown")   # interactive quiz questions
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Loading Packages

Code

library(tidyverse)   # loads stringr, dplyr, purrr, ggplot2, forcats
library(here)
library(flextable)
library(quanteda)
library(tm)
library(checkdown)
klippy::klippy()

Loading Example Texts

Throughout this tutorial we work with four example texts loaded from the LADAL data repository.

Code

# Text 1: paragraph about grammar (single string)
exampletext <- base::readRDS("tutorials/string/data/tx1.rda", "rb")

# Text 2: same paragraph split into sentences (character vector)
splitexampletext <- base::readRDS("tutorials/string/data/tx2.rda", "rb")

# Text 3: paragraph about Ferdinand de Saussure (single string)
additionaltext <- base::readRDS("tutorials/string/data/tx3.rda", "rb")

# Text 4: three short sentences (character vector)
sentences <- base::readRDS("tutorials/string/data/tx4.rda", "rb")

# Inspect
cat("exampletext (first 120 chars):\n", substr(exampletext, 1, 120), "\n\n")

exampletext (first 120 chars):
 Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply t

Code

cat("splitexampletext:\n"); print(splitexampletext); cat("\n")

splitexampletext:

[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Code

cat("sentences:\n"); print(sentences)

sentences:

[1] "This is a first sentence."     "This is a second sentence."   
[3] "And this is a third sentence."

Character Vectors in R

A character vector is R’s basic data structure for text. Each element is a separate string — exampletext is length 1 (one long string), while splitexampletext is length n (one element per sentence). Most stringr functions are vectorised: they accept vectors of any length and return a result of the same length, making it easy to process many strings at once.

Base R String Functions

Section Overview

What you will learn: The most important string functions available in base R — no packages required. These underpin everything else and appear throughout code you will encounter in the wild.

Case Conversion

Code

tolower(exampletext) |> substr(1, 80)

[1] "grammar is a system of rules which governs the production and use of utterances "

Code

toupper(exampletext) |> substr(1, 80)

[1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES "

String Length

Code

# Number of characters per element
nchar(splitexampletext)

[1] 100 295 126

Code

# NA-safe version
nchar(c("hello", NA, "world"), keepNA = TRUE)

[1]  5 NA  5

Substrings

Code

# Extract characters 1–60
substr(exampletext, 1, 60)

[1] "Grammar is a system of rules which governs the production an"

Code

# Replacement: overwrite a substring in-place
tmp <- exampletext
substr(tmp, 1, 7) <- "[REDACTED]"  # pads/truncates to match width
substr(tmp, 1, 25)

[1] "[REDACT is a system of ru"

Combining Strings

Code

paste("Participant", 1:4, sep = "_")       # with separator

[1] "Participant_1" "Participant_2" "Participant_3" "Participant_4"

Code

paste0("Item", LETTERS[1:4])               # no separator

[1] "ItemA" "ItemB" "ItemC" "ItemD"

Code

paste(sentences, collapse = " | ")         # collapse vector to one string

[1] "This is a first sentence. | This is a second sentence. | And this is a third sentence."

Pattern Matching and Replacement

Code

# grep: indices of matching elements
grep("grammar", splitexampletext)

[1] 3

Code

# grepl: logical vector
grepl("grammar", splitexampletext)

[1] FALSE FALSE  TRUE

Code

# sub: replace FIRST match per string
sub("grammar", "GRAMMAR", exampletext) |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production and use of utterances "

Code

# gsub: replace ALL matches per string
gsub("\\band\\b", "&", exampletext) |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production & use of utterances in"

Code

# ignore.case
grep("grammar", splitexampletext, ignore.case = TRUE)

[1] 1 3

gsub() vs. str_replace_all()

Both replace all occurrences of a pattern. The key practical difference is argument order: gsub(pattern, replacement, string) puts the string last (inconvenient for pipes), while str_replace_all(string, pattern, replacement) puts the string first (pipe-friendly). For new code, prefer stringr. For reading legacy code, recognise gsub.

Splitting Strings

Code

# strsplit returns a LIST — one element per input string
words_list <- strsplit(exampletext, "\\s+")
head(words_list[[1]], 10)

 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"

Code

# Flatten to a plain vector
words_vec <- strsplit(exampletext, "\\s+")[[1]]
length(words_vec)

[1] 81

✎ Check Your Understanding — Question 1

You have a character vector texts with 50 sentences. You want the indices of sentences that contain the word “the” (case-insensitive). Which call is correct?

grep("the", texts, ignore.case = TRUE) — returns matching indices
gsub("the", "", texts) — removes “the” from each sentence
grepl("the", texts, ignore.case = TRUE) — returns a logical vector, not indices
sub("the", "THE", texts) — replaces the first match only

Answer

a) grep("the", texts, ignore.case = TRUE)

grep() returns the positions (indices) of matching elements. grepl() (option c) is also useful but returns TRUE/FALSE — use it when filtering with texts[grepl(...)]. Options b and d perform replacements.

Core `stringr` Functions

Section Overview

What you will learn: The complete set of stringr functions for detecting, extracting, replacing, splitting, padding, ordering, and combining strings — all following the consistent str_verb(string, pattern) convention that makes them ideal for pipelines.

Detecting Patterns

Code

str_detect(splitexampletext, "grammar")           # logical vector

[1] FALSE FALSE  TRUE

Code

str_starts(splitexampletext, "[A-Z]")             # starts with capital

[1] TRUE TRUE TRUE

Code

str_ends(splitexampletext,   "\\.")               # ends with full stop

[1] TRUE TRUE TRUE

Code

str_which(splitexampletext,  "grammar")           # indices of matches

[1] 3

Code

str_count(exampletext, "\\band\\b")               # count occurrences

[1] 6

Extracting Patterns

Code

# First match per element
str_extract(splitexampletext, "\\b[A-Z][a-z]+\\b")

[1] "Grammar" "These"   "Many"

Code

# All matches per element (returns a list)
str_extract_all(exampletext, "\\b[A-Z][a-z]+\\b")[[1]]

[1] "Grammar" "These"   "Many"    "Noam"    "Chomsky"

Code

# First match plus capture groups (matrix: col 1 = full match, col 2+ = groups)
str_match(exampletext, "\\bthe (\\w+)\\b")

     [,1]             [,2]        
[1,] "the production" "production"

Code

# All matches plus groups
str_match_all(exampletext, "\\bthe (\\w+)\\b")[[1]] |> head(5)

     [,1]               [,2]          
[1,] "the production"   "production"  
[2,] "the organisation" "organisation"
[3,] "the formation"    "formation"   
[4,] "the formation"    "formation"   
[5,] "the principles"   "principles"

Replacing and Removing Patterns

Code

str_replace(exampletext, "grammar", "GRAMMAR") |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production and use of utterances "

Code

str_replace_all(exampletext, "\\band\\b", "&") |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production & use of utterances in"

Code

str_remove(exampletext, "\\bgrammar\\b") |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production and use of utterances "

Code

str_remove_all(exampletext, "[,;.]") |> substr(1, 80)

[1] "Grammar is a system of rules which governs the production and use of utterances "

Splitting Strings

Code

# str_split: returns a list
str_split(exampletext, "\\s+")[[1]] |> head(8)

[1] "Grammar" "is"      "a"       "system"  "of"      "rules"   "which"  
[8] "governs"

Code

# str_split_fixed: returns a matrix with exactly n columns
str_split_fixed(sentences, "\\s+", n = 3)

     [,1]   [,2]   [,3]                  
[1,] "This" "is"   "a first sentence."   
[2,] "This" "is"   "a second sentence."  
[3,] "And"  "this" "is a third sentence."

Code

# Split on sentence boundaries (lookbehind for .!?)
str_split(exampletext, "(?<=[.!?])\\s+")[[1]]

[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Subsetting Strings

Code

str_sub(exampletext, 1, 60)                           # by character position

[1] "Grammar is a system of rules which governs the production an"

Code

str_subset(splitexampletext, "grammar|syntax")        # keep matching elements

[1] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[2] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Code

str_trunc(splitexampletext, width = 45)               # truncate with "..."

[1] "Grammar is a system of rules which governs..."
[2] "These rules apply to sound as well as mean..."
[3] "Many modern theories that deal with the pr..."

Padding, Whitespace, and Truncation

String formatting for table output, report generation, and aligned displays is one of the most practically useful areas of stringr.

Code

# str_trim: remove leading and trailing whitespace
messy <- "  This   has  extra   spaces.  "
str_trim(messy)

[1] "This   has  extra   spaces."

Code

# str_squish: remove leading/trailing AND internal runs of whitespace
str_squish(messy)

[1] "This has extra spaces."

Code

# str_pad: add characters to reach a target width
# Useful for aligning columns in plain-text reports
words_ex <- c("the", "corpus", "linguistics", "syntax")
str_pad(words_ex, width = 15, side = "right")   # left-aligned (pad right)

[1] "the            " "corpus         " "linguistics    " "syntax         "

Code

str_pad(words_ex, width = 15, side = "left")    # right-aligned (pad left)

[1] "            the" "         corpus" "    linguistics" "         syntax"

Code

str_pad(words_ex, width = 15, side = "both")    # centred

[1] "      the      " "    corpus     " "  linguistics  " "    syntax     "

Code

# Custom pad character (e.g. for dot-leaders in a table of contents)
str_pad(words_ex, width = 20, side = "right", pad = ".")

[1] "the................." "corpus.............." "linguistics........."
[4] "syntax.............."

Code

# str_trunc with different sides
str_trunc("A very long sentence about linguistics.", width = 25, side = "right")

[1] "A very long sentence a..."

Code

str_trunc("A very long sentence about linguistics.", width = 25, side = "left")

[1] "...nce about linguistics."

Code

str_trunc("A very long sentence about linguistics.", width = 25, side = "center")

[1] "A very long...inguistics."

Code

# Practical example: create an aligned plain-text frequency table
word_freqs <- data.frame(
  word = c("grammar", "syntax", "morphology", "phonology", "semantics"),
  freq = c(42, 38, 27, 19, 14),
  stringsAsFactors = FALSE
)

# Format for aligned display
word_freqs |>
  dplyr::mutate(
    word_padded = str_pad(word, width = 12, side = "right"),
    freq_padded = str_pad(as.character(freq), width = 6, side = "left"),
    pct         = round(100 * freq / sum(freq), 1),
    pct_padded  = str_pad(paste0(pct, "%"), width = 7, side = "left")
  ) |>
  dplyr::mutate(row = paste(word_padded, freq_padded, pct_padded)) |>
  dplyr::pull(row) |>
  (\(x) c("Word         Count    Pct",
           paste(rep("-", 27), collapse = ""),
           x))() |>
  cat(sep = "\n")

Word         Count    Pct
---------------------------
grammar          42     30%
syntax           38   27.1%
morphology       27   19.3%
phonology        19   13.6%
semantics        14     10%

Number Formatting with formatC() and sprintf()

For numeric string formatting, base R’s formatC() and sprintf() complement str_pad():

# Fixed decimal places
formatC(3.14159, digits = 3, format = "f")   # "3.142"

# Thousands separator
formatC(12345678, format = "d", big.mark = ",")  # "12,345,678"

# sprintf: C-style formatting
sprintf("Mean RT = %.1f ms (SD = %.1f)", 612.4, 87.3)

# Percentage formatting
sprintf("%.1f%%", 0.347 * 100)   # "34.7%"

Combining and Interpolating Strings

`str_c()` and `str_flatten()`

Code

# str_c: concatenate element-wise (NA-safe unlike paste0)
str_c("P", str_pad(1:5, 2, pad = "0"), sep = "")   # P01, P02, ...

[1] "P01" "P02" "P03" "P04" "P05"

Code

# str_c with NA: propagates NA (unlike paste0 which gives "NA")
str_c("prefix_", c("a", NA, "c"))

[1] "prefix_a" NA         "prefix_c"

Code

paste0("prefix_", c("a", NA, "c"))    # compare: NA becomes "prefixNA"

[1] "prefix_a"  "prefix_NA" "prefix_c"

Code

# str_flatten: collapse a vector to a single string
str_flatten(sentences, collapse = " ")

[1] "This is a first sentence. This is a second sentence. And this is a third sentence."

Code

str_flatten(c("cat", "dog", "bird"), collapse = ", ", last = " and ")

[1] "cat, dog and bird"

`str_glue()`: String Interpolation

str_glue() embeds R expressions directly in strings using {...} placeholders. This is far more readable than nested paste() calls and is the recommended approach for generating report text, axis labels, and data-driven narrative.

Code

# Basic interpolation
speaker  <- "P03"
n_tokens <- 1247
lang     <- "English"

str_glue("Speaker {speaker} (L1: {lang}) produced {n_tokens} tokens.")

Speaker P03 (L1: English) produced 1247 tokens.

Code

# Arithmetic inside {}
str_glue("Mean rate: {round(n_tokens / 60, 1)} tokens per minute.")

Mean rate: 20.8 tokens per minute.

Code

# Conditional text
proficiency <- "Advanced"
str_glue("Speaker {speaker} is {tolower(proficiency)}.",
         " ",
         "Their token count was {ifelse(n_tokens > 1000, 'above', 'below')} 1,000.")

Speaker P03 is advanced. Their token count was above 1,000.

Code

# Multi-line glue (newlines are preserved unless you collapse)
str_glue(
  "--- Speaker Report ---\n",
  "ID:          {speaker}\n",
  "L1:          {lang}\n",
  "Tokens:      {n_tokens}\n",
  "Proficiency: {proficiency}"
)

--- Speaker Report ---
ID:          P03
L1:          English
Tokens:      1247
Proficiency: Advanced

`str_glue_data()`: Interpolation Over a Data Frame

str_glue_data() applies the template to every row of a data frame. This is ideal for generating per-participant summaries, axis labels, or APA-style results sentences.

Code

# Sample participant data
participants <- data.frame(
  id          = paste0("P", str_pad(1:6, 2, pad = "0")),
  l1          = c("English", "German", "French", "Japanese", "Spanish", "Mandarin"),
  tokens      = c(1247, 983, 1105, 876, 1031, 942),
  accuracy    = c(0.92, 0.87, 0.89, 0.84, 0.91, 0.86),
  proficiency = c("Advanced", "Intermediate", "Advanced",
                  "Intermediate", "Advanced", "Intermediate"),
  stringsAsFactors = FALSE
)

# Generate one summary sentence per participant
participants |>
  str_glue_data(
    "Speaker {id} (L1: {l1}, {proficiency}) produced {tokens} tokens ",
    "with {round(accuracy * 100, 1)}% accuracy."
  )

Speaker P01 (L1: English, Advanced) produced 1247 tokens with 92% accuracy.
Speaker P02 (L1: German, Intermediate) produced 983 tokens with 87% accuracy.
Speaker P03 (L1: French, Advanced) produced 1105 tokens with 89% accuracy.
Speaker P04 (L1: Japanese, Intermediate) produced 876 tokens with 84% accuracy.
Speaker P05 (L1: Spanish, Advanced) produced 1031 tokens with 91% accuracy.
Speaker P06 (L1: Mandarin, Intermediate) produced 942 tokens with 86% accuracy.

Code

# Generate APA-style result sentences for each comparison
results_df <- data.frame(
  comparison = c("Primed vs. Unprimed", "High- vs. Low-Frequency"),
  beta       = c(-0.082, -0.051),
  se         = c(0.018, 0.013),
  t_val      = c(-4.56, -3.92),
  p_val      = c(0.0001, 0.0009),
  stringsAsFactors = FALSE
)

results_df |>
  str_glue_data(
    "{comparison}: β = {round(beta, 3)}, SE = {round(se, 3)}, ",
    "t = {round(t_val, 2)}, p {ifelse(p_val < .001, '< .001', paste0('= ', round(p_val, 3)))}."
  )

Primed vs. Unprimed: β = -0.082, SE = 0.018, t = -4.56, p < .001.
High- vs. Low-Frequency: β = -0.051, SE = 0.013, t = -3.92, p < .001.

When to Use str_glue() vs. paste()

Use str_glue() whenever you have more than one or two variables to embed in a string. The {variable} syntax reads naturally as prose and supports arbitrary R expressions, while paste() becomes hard to read as the number of arguments grows. For vectorised row-by-row generation from a data frame, always prefer str_glue_data() over apply() + paste().

Sorting and Ordering

Code

str_sort(sentences)                                    # default locale

[1] "And this is a third sentence." "This is a first sentence."    
[3] "This is a second sentence."

Code

str_sort(sentences, decreasing = TRUE)

[1] "This is a second sentence."    "This is a first sentence."    
[3] "And this is a third sentence."

Code

# Locale matters for non-English alphabets
nordic <- c("ångström", "öl", "äpple", "banan", "citron")
str_sort(nordic)                     # incorrect for Swedish

[1] "ångström" "äpple"    "banan"    "citron"   "öl"

Code

str_sort(nordic, locale = "sv")      # correct Swedish alphabetical order

[1] "banan"    "citron"   "ångström" "äpple"    "öl"

Code

str_order(sentences)                 # returns ordering indices

[1] 3 1 2

Your turn!

Q2 You have an interview transcript and want to replace every occurrence of a participant’s real name (“Sarah”) with the pseudonym “P01”. Which stringr function is correct?

Q3 Which stringr functions manipulate whitespace? (Select all that apply.)

Working with Factors as Strings

Section Overview

What you will learn: How factors differ from character vectors; why factor level ordering matters for plots and models; and how to use forcats to relabel, reorder, collapse, and filter factor levels — tasks that arise constantly when working with categorical linguistic data (POS tags, speaker groups, genre labels, annotation codes)

Factors vs. Character Vectors

A factor is a categorical variable stored as integers with a character levels attribute. Factors are essential for:

Controlling the order of categories in plots (without factors, ggplot2 sorts alphabetically)
Setting reference levels in regression models
Summarising data by a fixed set of categories (including empty ones)

Code

# Character vector vs. factor
pos_chars  <- c("NN", "VBZ", "DT", "NN", "JJ", "NN", "VBZ", "RB")
pos_factor <- factor(pos_chars,
                     levels = c("DT", "JJ", "NN", "RB", "VBZ"))

# Key differences
class(pos_chars)       # "character"

[1] "character"

Code

class(pos_factor)      # "factor"

[1] "factor"

Code

levels(pos_factor)     # the defined level set, in order

[1] "DT"  "JJ"  "NN"  "RB"  "VBZ"

Code

nlevels(pos_factor)    # number of levels

[1] 5

Code

# A factor remembers ALL levels even if some are absent in the data
absent_level <- factor(c("A", "B"), levels = c("A", "B", "C"))
table(absent_level)    # C appears with count 0

absent_level
A B C 
1 1 0

The `forcats` Package

forcats (loaded as part of the tidyverse) provides a coherent set of functions for working with factors. All function names begin with fct_.

Reordering Levels

Code

# Sample annotation data
anno_df <- data.frame(
  token = c("the", "corpus", "contains", "very", "interesting", "data",
            "the", "speaker", "spoke", "quite", "quickly", "today"),
  upos  = c("DT", "NN", "VBZ", "RB", "JJ", "NN",
             "DT", "NN", "VBD", "RB", "RB", "NN"),
  stringsAsFactors = FALSE
)

# Without forcats: alphabetical order in plot (rarely what we want)
ggplot(anno_df, aes(x = upos)) +
  geom_bar(fill = "steelblue") +
  theme_bw() +
  labs(title = "POS distribution (alphabetical — default)")

Code

# fct_infreq: order by descending frequency
anno_df |>
  dplyr::mutate(upos = forcats::fct_infreq(upos)) |>
  ggplot(aes(x = upos)) +
  geom_bar(fill = "steelblue") +
  theme_bw() +
  labs(title = "POS distribution (ordered by frequency)")

Code

# fct_rev: reverse current level order
anno_df |>
  dplyr::mutate(upos = forcats::fct_rev(forcats::fct_infreq(upos))) |>
  ggplot(aes(x = upos)) +
  geom_col(stat = "count", fill = "steelblue") +
  coord_flip() +
  theme_bw() +
  labs(title = "POS distribution (frequency order, horizontal)")

Code

# fct_reorder: order a factor by a summary statistic of another variable
rt_df <- data.frame(
  condition   = rep(c("Primed", "Unprimed", "Filler"), each = 40),
  rt          = c(rnorm(40, 580, 60), rnorm(40, 650, 70), rnorm(40, 700, 80))
)

# Without reordering: arbitrary condition order
rt_df |>
  dplyr::mutate(condition = forcats::fct_reorder(condition, rt, .fun = median)) |>
  ggplot(aes(x = condition, y = rt, fill = condition)) +
  geom_boxplot(show.legend = FALSE) +
  theme_bw() +
  labs(title = "RT by condition (ordered by median RT)",
       x = "Condition", y = "Reaction time (ms)")

Relabelling Levels

Code

# fct_recode: rename individual levels
pos_factor_labelled <- forcats::fct_recode(
  factor(anno_df$upos),
  "Determiner"  = "DT",
  "Adjective"   = "JJ",
  "Noun"        = "NN",
  "Adverb"      = "RB",
  "Verb (past)" = "VBD",
  "Verb (pres)" = "VBZ"
)
levels(pos_factor_labelled)

[1] "Determiner"  "Adjective"   "Noun"        "Adverb"      "Verb (past)"
[6] "Verb (pres)"

Code

table(pos_factor_labelled)

pos_factor_labelled
 Determiner   Adjective        Noun      Adverb Verb (past) Verb (pres) 
          2           1           4           3           1           1

Code

# fct_relabel: apply a function to ALL level names at once
pos_lower <- forcats::fct_relabel(factor(anno_df$upos), tolower)
levels(pos_lower)

[1] "dt"  "jj"  "nn"  "rb"  "vbd" "vbz"

Collapsing and Lumping Levels

When a factor has many levels, it is often useful to collapse rare or related levels into a single catch-all category.

Code

# Simulate a larger POS-tagged corpus
set.seed(42)
all_pos <- sample(
  c("NN", "VBZ", "DT", "JJ", "RB", "IN", "PRP", "VBD", "NNS", "VBP",
    "CC", "MD", "WP", "EX", "UH"),
  size    = 200,
  replace = TRUE,
  prob    = c(0.20, 0.12, 0.11, 0.09, 0.08, 0.07, 0.06, 0.06,
              0.05, 0.04, 0.04, 0.03, 0.02, 0.02, 0.01)
)

pos_factor_full <- factor(all_pos)
nlevels(pos_factor_full)  # 15 levels — hard to visualise

[1] 14

Code

# fct_lump_n: keep the n most frequent levels, collapse the rest to "Other"
pos_lumped_5 <- forcats::fct_lump_n(pos_factor_full, n = 5)
table(pos_lumped_5)

pos_lumped_5
   DT    JJ    NN   VBD   VBZ Other 
   19    16    41    22    18    84

Code

# fct_lump_prop: keep levels accounting for > prop of observations
pos_lumped_prop <- forcats::fct_lump_prop(pos_factor_full, prop = 0.05)
table(pos_lumped_prop)

pos_lumped_prop
   DT    IN    JJ    NN   NNS   PRP    RB   VBD   VBZ Other 
   19    15    16    41    12    13    14    22    18    30

Code

# fct_other: manually specify which levels to keep (all others → "Other")
pos_content <- forcats::fct_other(
  pos_factor_full,
  keep = c("NN", "NNS", "VBZ", "VBD", "VBP", "JJ"),
  other_level = "Function"
)
table(pos_content)

pos_content
      JJ       NN      NNS      VBD      VBP      VBZ Function 
      16       41       12       22        9       18       82

Adding and Dropping Levels

Code

# fct_drop: remove levels that have no observations
all_genres <- factor(c("academic", "fiction", "news"),
                     levels = c("academic", "fiction", "news", "spoken", "web"))
nlevels(all_genres)           # 5 levels

[1] 5

Code

nlevels(forcats::fct_drop(all_genres))  # 3 levels

[1] 3

Code

# fct_expand: add new levels (useful before rbind-ing data frames)
expanded <- forcats::fct_expand(all_genres, "social_media", "blog")
levels(expanded)

[1] "academic"     "fiction"      "news"         "spoken"       "web"         
[6] "social_media" "blog"

Code

# fct_na_value_to_level: treat NA as an explicit factor level
with_na  <- factor(c("academic", NA, "fiction", NA, "news"))
with_na_level <- forcats::fct_na_value_to_level(with_na, level = "Unknown")
table(with_na_level, useNA = "always")

with_na_level
academic  fiction     news  Unknown     <NA> 
       1        1        1        2        0

✎ Check Your Understanding — Question 4

A researcher has a factor genre with levels in alphabetical order: "academic", "fiction", "news", "spoken". She wants to reorder the bars in a ggplot2 bar chart so that the most frequent genre appears first. Which forcats function should she use?

fct_reorder(genre, genre) — reorder by alphabetical value
fct_infreq(genre) — reorder levels by descending frequency of observations
fct_rev(genre) — reverse the current alphabetical order
fct_recode(genre) — rename the level labels

Answer

b) fct_infreq(genre) — reorder levels by descending frequency of observations

fct_infreq() reorders factor levels so that the most frequently occurring level comes first, which is exactly what places it as the first bar in a bar chart. fct_reorder() (option a) reorders by a summary statistic of another variable (e.g. median RT), not by the factor’s own frequency. fct_rev() only reverses the existing order without considering frequency. fct_recode() changes level names, not order.

Unicode, Encoding, and Non-ASCII Characters

Section Overview

What you will learn: What text encoding is and why it matters for linguistic data; how to detect and fix encoding problems; how to work with IPA symbols, non-Latin scripts, and Unicode special characters in R; and locale-aware case conversion for non-English languages

What Is Text Encoding?

A character encoding maps characters to binary numbers. The most important encodings for linguistic research are:

Common text encodings
Encoding	Coverage	When you encounter it
UTF-8	All Unicode characters (~150,000)	Modern files, web data, recommended default
Latin-1 / ISO-8859-1	Western European languages	Older files, Windows legacy
Windows-1252 (CP1252)	Western European + smart quotes	Files created on Windows
UTF-16	All Unicode (2 or 4 bytes)	Some Windows apps, older XML

Always Use UTF-8

Save all R scripts and data files in UTF-8. In RStudio: File → Save with Encoding → UTF-8. Set your default in Tools → Global Options → Code → Saving → Default text encoding: UTF-8. Nearly all encoding headaches arise from mixing UTF-8 and Latin-1 files.

Detecting and Converting Encodings

Code

# str_conv: convert encoding
latin1_text <- iconv("café résumé naïve", to = "latin1")
utf8_text   <- stringr::str_conv(latin1_text, encoding = "latin1")
utf8_text

[1] "café résumé naïve"

Code

# iconv: lower-level conversion with error handling
# sub = "byte": replace invalid bytes with their hex code (never fails)
# sub = NA:     return NA for strings with invalid bytes (for detection)
mixed <- c("valid UTF-8", iconv("caf\xe9", from = "latin1", to = "UTF-8"))
iconv(mixed, from = "UTF-8", to = "UTF-8", sub = NA)

[1] "valid UTF-8" "café"

Code

# Detect encoding of an unknown file (requires stringi)
# stringi::stri_enc_detect(readBin("unknown_file.txt", "raw", 10000))

IPA and Phonetic Symbols

IPA symbols are fully supported in R as UTF-8 Unicode code points:

Code

# IPA transcriptions
ipa <- c(
  "linguistics"  = "/lɪŋˈɡwɪstɪks/",
  "phonology"    = "/fəˈnɒlədʒi/",
  "morphology"   = "/mɔːˈfɒlədʒi/",
  "syntax"       = "/ˈsɪntæks/",
  "semantics"    = "/sɪˈmæntɪks/"
)

nchar(ipa)                              # character count per transcription

linguistics   phonology  morphology      syntax   semantics 
         14          12          13          10          12

Code

str_detect(ipa, "ɪ")                   # detect the IPA SMALL CAPITAL I

[1]  TRUE FALSE FALSE  TRUE  TRUE

Code

str_extract_all(ipa, "[ˈˌ][^ˈˌ/]+")   # extract stressed syllables

[[1]]
[1] "ˈɡwɪstɪks"

[[2]]
[1] "ˈnɒlədʒi"

[[3]]
[1] "ˈfɒlədʒi"

[[4]]
[1] "ˈsɪntæks"

[[5]]
[1] "ˈmæntɪks"

Code

# Remove stress marks and syllable boundaries
str_remove_all(ipa, "[ˈˌ.\\-]")

[1] "/lɪŋɡwɪstɪks/" "/fənɒlədʒi/"   "/mɔːfɒlədʒi/"  "/sɪntæks/"    
[5] "/sɪmæntɪks/"

Code

# Extract only vowels (broad IPA vowel symbols)
vowels_ipa <- "[aeiouæɑɒɔəɛɜɪʊʌ]"
str_extract_all(ipa, vowels_ipa) |>
  purrr::map(~ paste(.x, collapse = "")) |>
  unlist()

[1] "ɪɪɪ"  "əɒəi" "ɔɒəi" "ɪæ"   "ɪæɪ"

Non-Latin Scripts

Code

# R handles any Unicode script natively
arabic   <- "اللغويات"          # Arabic: "linguistics"
chinese  <- "语言学"             # Mandarin: "linguistics"
japanese <- "言語学"             # Japanese: "linguistics"
greek    <- "γλωσσολογία"        # Greek: "glōssología"
russian  <- "лингвистика"        # Russian: "lingvistika"
hindi    <- "भाषाविज्ञान"       # Hindi: "bhāṣāvijñāna"

scripts  <- c(arabic, chinese, japanese, greek, russian, hindi)
nchar(scripts)                   # character count (code points)

[1]  8  3  3 11 11 11

Code

# str_length is an alias for nchar in stringr
str_length(scripts)

[1]  8  3  3 11 11 11

Code

# Detect Cyrillic characters
str_detect(scripts, "\\p{Script=Cyrillic}")

[1] FALSE FALSE FALSE FALSE  TRUE FALSE

Code

# Detect CJK characters (Chinese/Japanese/Korean)
str_detect(scripts, "\\p{Script=Han}")

[1] FALSE  TRUE  TRUE FALSE FALSE FALSE

Unicode Script Properties in Regex

PCRE (which stringr uses) supports Unicode property escapes of the form \p{Property=Value}. Useful ones for linguists:

Unicode property escapes
Pattern	Matches
`\p{L}`	Any Unicode letter
`\p{Lu}`	Uppercase letter
`\p{Ll}`	Lowercase letter
`\p{N}`	Any numeric character
`\p{Script=Latin}`	Latin-script characters
`\p{Script=Arabic}`	Arabic-script characters
`\p{Script=Han}`	CJK characters

Locale-Aware Case Conversion

Code

# Turkish has dotted/dotless i — standard tolower/toupper fails
str_to_upper("istanbul", locale = "tr")   # İSTANBUL (correct for Turkish)

[1] "İSTANBUL"

Code

str_to_upper("istanbul", locale = "en")   # ISTANBUL (English behaviour)

[1] "ISTANBUL"

Code

str_to_lower("İSTANBUL", locale = "tr")   # istanbul

[1] "istanbul"

Code

str_to_lower("İSTANBUL", locale = "en")   # i̇stanbul (wrong for Turkish)

[1] "i̇stanbul"

Code

# German sharp s
str_to_upper("straße", locale = "de")     # STRASSE (ß → SS in uppercase)

[1] "STRASSE"

Code

# str_to_title: capitalise first letter of each word
str_to_title("the quick brown fox", locale = "en")

[1] "The Quick Brown Fox"

✎ Check Your Understanding — Question 5

You are processing a corpus of files downloaded from an older German website. After reading the files with readLines(), some strings contain the bytes \xfc (ü), \xe4 (ä), and \xf6 (ö), appearing as garbled characters. What is the most likely cause and the correct fix?

The files are corrupted — re-download them
The files are encoded in Latin-1 (or Windows-1252), not UTF-8. Use readLines(f, encoding = "latin1") or iconv(text, from = "latin1", to = "UTF-8")
R does not support German characters — use Python instead
Use str_squish() to clean the garbled bytes

Answer

b) The files are encoded in Latin-1 (or Windows-1252), not UTF-8

The byte values \xfc, \xe4, and \xf6 are the Latin-1 encodings of ü, ä, and ö — common German characters. When R reads a file assuming UTF-8 but the file is Latin-1, these multi-byte characters appear garbled. The fix is to read with the correct encoding: readLines(f, encoding = "latin1"), or convert afterwards with iconv(text, from = "latin1", to = "UTF-8"). Option (d) is wrong — str_squish() handles whitespace only and has no effect on byte values.

Regular Expressions

Section Overview

What you will learn: How to write regex patterns using character classes, quantifiers, anchors, alternation, groups, named capture groups, and lookahead/lookbehind — with linguistic examples throughout. The focus is on patterns that arise in real linguistic data processing.

Special Characters and Escaping

Most characters match themselves literally. The following have special meaning and must be escaped with \\ in R strings:

. * + ? ^ $ ( ) [ ] { } | \

Code

# Match a literal full stop (. means "any character" in regex)
str_detect(c("end.", "end!"), "end\\.")   # only "end." matches

[1]  TRUE FALSE

Code

# Match a literal parenthesis
str_extract("Syntax (Chomsky 1957)", "\\([^)]+\\)")

[1] "(Chomsky 1957)"

Character Classes

Code

str_extract_all("linguistics", "[aeiou]")[[1]]          # vowels only

[1] "i" "u" "i" "i"

Code

str_extract_all("Word1 word2", "[A-Za-z]+")[[1]]        # letter sequences

[1] "Word" "word"

Code

str_extract_all("Score: 4/5", "[^A-Za-z: /]")[[1]]     # negated class

[1] "4" "5"

Code

# Shorthand classes
# \\d = [0-9]   \\D = [^0-9]
# \\w = [A-Za-z0-9_]   \\W = non-word
# \\s = whitespace      \\S = non-whitespace
# \\b = word boundary (zero-width)

str_extract_all("Call 0412 345 678", "\\d+")[[1]]

[1] "0412" "345"  "678"

Code

str_extract_all("one two three", "\\b\\w+\\b")[[1]]

[1] "one"   "two"   "three"

Quantifiers

Regex quantifiers
Quantifier	Meaning	Example
`?`	0 or 1	`colou?r` → colour, color
`*`	0 or more	`\\d*` → zero or more digits
`+`	1 or more	`\\d+` → one or more digits
`{n}`	Exactly n	`\\w{4}` → four-letter words
`{n,m}`	Between n and m	`\\d{2,4}` → 2–4 digits
`*?` `+?`	Lazy (minimal)	Match as little as possible

Code

verbs <- c("walk", "walks", "walking", "walked", "runner")

str_subset(verbs, "\\w+ing$")           # -ing forms

[1] "walking"

Code

str_subset(verbs, "\\w+ed$")            # -ed forms

[1] "walked"

Code

str_subset(verbs, "^\\w{4}$")           # exactly 4 characters

[1] "walk"

Code

str_detect(c("colour", "color"), "colou?r")  # optional u

[1] TRUE TRUE

Code

# Greedy vs. lazy
quoted <- 'She said "very" and he said "quite good"'
str_extract(quoted, '".*"')             # greedy: first to last "

[1] "\"very\" and he said \"quite good\""

Code

str_extract(quoted, '".*?"')            # lazy:  first to next "

[1] "\"very\""

Anchors and Word Boundaries

Code

lines <- c("Grammar is structural.", "The grammar of English.", "grammar matters.")

str_subset(lines, "^[A-Z]")            # starts with capital letter

[1] "Grammar is structural."  "The grammar of English."

Code

str_subset(lines, "\\.$")              # ends with full stop

[1] "Grammar is structural."  "The grammar of English."
[3] "grammar matters."

Code

# Word boundaries prevent partial matches
str_count(exampletext, "the")          # matches "the", "other", "there"...

[1] 6

Code

str_count(exampletext, "\\bthe\\b")    # only the exact word "the"

[1] 5

Alternation and Groups

Code

# Alternation: | inside ()
str_subset(
  c("very nice", "quite good", "so interesting", "fairly common"),
  "\\b(very|quite|so|fairly)\\b"
)

[1] "very nice"      "quite good"     "so interesting" "fairly common"

Code

# Grouping for quantifiers
str_detect(c("haha", "hahaha", "ha", "hahahahaha"), "(ha){2,}")

[1]  TRUE  TRUE FALSE  TRUE

Code

# Back-references: \\1 matches what group 1 captured
redupl <- c("so so tired", "very very slowly", "quite good")
str_detect(redupl, "\\b(\\w+) \\1\\b")   # reduplicated word

[1]  TRUE  TRUE FALSE

Code

str_match(redupl, "\\b(\\w+) \\1\\b")[, 2]  # extract the word

[1] "so"   "very" NA

Code

# Match colour/color
str_detect(c("colour", "color", "colouring"), "colou?r")

[1] TRUE TRUE TRUE

Named Capture Groups

Named capture groups ((?<name>...)) make complex extraction readable and robust. The group’s value can be accessed by name from the result matrix, which is safer than relying on column position.

Code

# Extract structured information from POS-tagged text
# Format: WORD/POS/LEMMA
tagged <- c("The/DT/the", "corpus/NN/corpus", "contains/VBZ/contain",
            "very/RB/very", "interesting/JJ/interesting", "data/NN/datum")

pattern <- "(?<word>[^/]+)/(?<pos>[^/]+)/(?<lemma>[^/]+)"
m <- str_match(tagged, pattern)

anno_df <- data.frame(
  word  = m[, "word"],
  pos   = m[, "pos"],
  lemma = m[, "lemma"],
  stringsAsFactors = FALSE
)
anno_df

         word pos       lemma
1         The  DT         the
2      corpus  NN      corpus
3    contains VBZ     contain
4        very  RB        very
5 interesting  JJ interesting
6        data  NN       datum

Code

# Extract IPA transcriptions from formatted dictionary entries
dict <- c(
  "linguistics /lɪŋˈɡwɪstɪks/ noun",
  "phonology /fəˈnɒlədʒi/ noun",
  "morphology /mɔːˈfɒlədʒi/ noun",
  "syntax /ˈsɪntæks/ noun"
)
ipa_pattern <- "(?<word>\\w+) /(?<ipa>[^/]+)/ (?<pos>\\w+)"
ipa_m       <- str_match(dict, ipa_pattern)

data.frame(
  word = ipa_m[, "word"],
  ipa  = ipa_m[, "ipa"],
  pos  = ipa_m[, "pos"],
  stringsAsFactors = FALSE
)

         word          ipa  pos
1 linguistics lɪŋˈɡwɪstɪks noun
2   phonology   fəˈnɒlədʒi noun
3  morphology  mɔːˈfɒlədʒi noun
4      syntax     ˈsɪntæks noun

Code

# Named groups with str_match_all for multiple matches per string
# Extract all citation references: Author (Year) format
text_with_cites <- paste(
  "As Chomsky (1957) argued, and later confirmed by Labov (1972),",
  "sociolinguistic variation (Trudgill 1974; Milroy 1980) is systematic."
)

cite_pattern <- "(?<author>[A-Z][a-z]+)\\s+\\((?<year>\\d{4})\\)"
cite_matches <- str_match_all(text_with_cites, cite_pattern)[[1]]

data.frame(
  author = cite_matches[, "author"],
  year   = as.integer(cite_matches[, "year"]),
  stringsAsFactors = FALSE
)

   author year
1 Chomsky 1957
2   Labov 1972

Lookahead and Lookbehind

Lookaround assertions match a position relative to a pattern without including the pattern itself in the match result.

Lookaround syntax
Assertion	Syntax	Meaning
Positive lookahead	`(?=...)`	Position followed by …
Negative lookahead	`(?!...)`	Position NOT followed by …
Positive lookbehind	`(?<=...)`	Position preceded by …
Negative lookbehind	`(?<!...)`	Position NOT preceded by …

Code

# Words immediately preceding "grammar"
str_extract_all(exampletext, "\\w+(?=\\s+grammar)")[[1]]

[1] "of"

Code

# Words immediately following "the"
str_extract_all(exampletext, "(?<=\\bthe\\s)\\w+")[[1]]

[1] "production"   "organisation" "formation"    "formation"    "principles"

Code

# Amplified adjectives: adjectives following "very" or "quite"
amp_sent <- "The very beautiful garden and the quite interesting lecture."
str_extract_all(amp_sent, "(?<=very |quite )\\w+")[[1]]

[1] "beautiful"   "interesting"

Code

# Split on sentence boundaries WITHOUT consuming the punctuation
# (?<=[.!?]) = preceded by sentence-final punctuation
sentences_split <- str_split(exampletext, "(?<=[.!?])\\s+")[[1]]
sentences_split

[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Practical Regex for Linguistic Data

Code

# 1. Extract all -ing forms
str_extract_all(exampletext, "\\b\\w+ing\\b")[[1]]

[1] "meaning"    "pertaining"

Code

# 2. Remove XML/HTML tags (common in corpus data)
tagged_text <- "<p>The <hi rend=\"italic\">corpus</hi> contains <b>data</b>.</p>"
str_remove_all(tagged_text, "<[^>]+>")

[1] "The corpus contains data."

Code

# 3. Extract quoted speech
narrative <- 'She said "I will return" and he replied "Good luck".'
str_extract_all(narrative, '"([^"]+)"')[[1]]

[1] "\"I will return\"" "\"Good luck\""

Code

# 4. Extract year references from academic text
academic <- "Chomsky (1957), Labov (1972), and Trudgill (1974) all contributed."
str_extract_all(academic, "\\d{4}")[[1]]

[1] "1957" "1972" "1974"

Code

# 5. Detect passive constructions (rough heuristic)
passive_pat <- "\\b(is|are|was|were|been)\\s+\\w+ed\\b"
str_detect(splitexampletext, passive_pat)

[1] FALSE FALSE  TRUE

Code

# 6. Anonymise emails
emails_text <- "Contact martin@ladal.edu.au or admin@university.org for details."
str_replace_all(emails_text,
                "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
                "[EMAIL REDACTED]")

[1] "Contact [EMAIL REDACTED] or [EMAIL REDACTED] for details."

Your turn!

Q6 Which regex correctly matches whole words ending in -tion or -sion (e.g. intention, tension)?

Q7 You want to extract the word immediately after “very” in a text, without including “very” in the result. Which regex feature achieves this?

Text Cleaning Pipelines

Section Overview

What you will learn: How to combine multiple string operations into a single reusable cleaning function; common preprocessing steps for corpus linguistics; a tm-based pipeline and a stringr-based alternative; and how to apply either to a full directory of texts

Why Build a Pipeline?

Text cleaning for corpus analysis chains many steps — lowercasing, removing markup, stripping punctuation, removing numbers, eliminating stopwords, collapsing whitespace — and you need to apply the exact same sequence to every text. Encoding the pipeline as a function ensures reproducibility, transparency, and reusability.

When NOT to Remove Stopwords

Stopword removal is appropriate for topic modelling and keyword extraction. But it is inappropriate for grammatical analysis (function words are the data), discourse analysis (markers like well, so, I mean are usually stopwords but often exactly what you want), and sentiment analysis (negation words like not, never are on stopword lists but reverse polarity). Always check whether the words you remove are relevant to your research question.

The `tm` Building Blocks

Code

raw <- paste(
  "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central",
  "to Linguistics. There are 3 main branches — explored by linguists since the 19th century."
)

tm::removeNumbers(raw)

[1] "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are  main branches — explored by linguists since the th century."

Code

tm::removePunctuation(raw)

[1] "The study of Grammar including bSyntaxb Morphology and Phonology is central to Linguistics There are 3 main branches — explored by linguists since the 19th century"

Code

tm::removeWords(raw, tm::stopwords("english"))

[1] "The study  Grammar (including <b>Syntax</b>, Morphology,  Phonology)  central  Linguistics. There  3 main branches — explored  linguists since  19th century."

Code

tm::stripWhitespace(raw)

[1] "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are 3 main branches — explored by linguists since the 19th century."

Code

tm::stemDocument(raw, language = "en")

[1] "The studi of Grammar (includ <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are 3 main branch — explor by linguist sinc the 19th century."

A Reusable `tm`-Based Pipeline

Code

clean_text_tm <- function(text,
                           lowercase     = TRUE,
                           rm_markup     = TRUE,
                           rm_punct      = TRUE,
                           rm_numbers    = TRUE,
                           rm_stopwords  = TRUE,
                           stopword_lang = "english",
                           stem          = FALSE,
                           squish_ws     = TRUE) {
  out <- text
  if (rm_markup)    out <- stringr::str_remove_all(out, "<[^>]+>")
  if (lowercase)    out <- tolower(out)
  if (rm_punct)     out <- tm::removePunctuation(out)
  if (rm_numbers)   out <- tm::removeNumbers(out)
  if (rm_stopwords) out <- tm::removeWords(out, tm::stopwords(stopword_lang))
  if (stem)         out <- tm::stemDocument(out, language = stopword_lang)
  if (squish_ws)    out <- tm::stripWhitespace(out)
  stringr::str_trim(out)
}

clean_text_tm(raw)

[1] "study grammar including syntax morphology phonology central linguistics main branches — explored linguists since th century"

Code

clean_text_tm(raw, rm_stopwords = FALSE) |> substr(1, 80)

[1] "the study of grammar including syntax morphology and phonology is central to lin"

Code

clean_text_tm(raw, stem = TRUE) |> substr(1, 80)

[1] "studi grammar includ syntax morpholog phonolog central linguist main branch — ex"

A `stringr`-Based Pipeline

The stringr alternative gives more control over punctuation rules and handles Unicode better:

Code

clean_text_stringr <- function(text,
                                lowercase     = TRUE,
                                rm_markup     = TRUE,
                                rm_punct      = TRUE,
                                rm_numbers    = TRUE,
                                rm_stopwords  = TRUE,
                                keep_hyphens  = TRUE,
                                squish_ws     = TRUE) {
  out <- text

  # 1. Remove XML/HTML markup
  if (rm_markup)  out <- str_remove_all(out, "<[^>]+>")

  # 2. Lowercase
  if (lowercase)  out <- str_to_lower(out)

  # 3. Remove punctuation (optionally keep internal hyphens)
  if (rm_punct) {
    if (keep_hyphens) {
      out <- str_remove_all(out, "[^\\w\\s\\-]")   # keep - inside words
    } else {
      out <- str_remove_all(out, "[^\\w\\s]")
    }
  }

  # 4. Remove numbers
  if (rm_numbers) out <- str_remove_all(out, "\\d+")

  # 5. Remove stopwords with word-boundary matching
  if (rm_stopwords) {
    stops   <- tm::stopwords("english")
    pattern <- str_c("\\b(", str_c(stops, collapse = "|"), ")\\b")
    out     <- str_remove_all(out, pattern)
  }

  # 6. Collapse whitespace
  if (squish_ws) out <- str_squish(out)

  out
}

clean_text_stringr(raw)

[1] "study grammar including syntax morphology phonology central linguistics main branches explored linguists since th century"

Code

# Demonstrate keep_hyphens option
hyphen_text <- "Well-known socio-linguistic phenomena include code-switching."
clean_text_stringr(hyphen_text, rm_stopwords = FALSE, keep_hyphens = TRUE)

[1] "well-known socio-linguistic phenomena include code-switching"

Code

clean_text_stringr(hyphen_text, rm_stopwords = FALSE, keep_hyphens = FALSE)

[1] "wellknown sociolinguistic phenomena include codeswitching"

Applying a Pipeline to a Corpus

Code

# Simulate a small corpus (in practice: read from files)
corpus_raw <- c(
  T01 = "The <b>grammar</b> of English has changed since the 1800s.",
  T02 = "Syntax deals with sentence structure — 3 main frameworks exist.",
  T03 = "Morphology examines word formation and the structure of words.",
  T04 = "Phonology studies the sound systems of languages (44 phonemes in English)."
)

# Apply pipeline to all texts
corpus_clean <- purrr::map_chr(corpus_raw, clean_text_stringr)

# Display before/after
data.frame(
  id     = names(corpus_raw),
  before = str_trunc(corpus_raw,   60),
  after  = str_trunc(corpus_clean, 60)
) |>
  flextable() |>
  flextable::set_table_properties(width = 1, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption("Corpus texts before and after cleaning pipeline")

id	before	after
T01	The <b>grammar</b> of English has changed since the 1800s.	grammar english changed since s
T02	Syntax deals with sentence structure — 3 main frameworks ...	syntax deals sentence structure main frameworks exist
T03	Morphology examines word formation and the structure of w...	morphology examines word formation structure words
T04	Phonology studies the sound systems of languages (44 phon...	phonology studies sound systems languages phonemes english

✎ Check Your Understanding — Question 8

A researcher applies the pipeline lowercase → removePunctuation → removeStopwords → stripWhitespace to her corpus. She later finds that “not interesting” has become just “interesting” throughout, reversing the intended meaning of many sentences. Which step caused this and how should she fix it?

lowercase — preserving capitalisation would have prevented this
removePunctuation — punctuation carries semantic information
removeStopwords — “not” is on the English stopword list; she should use a custom stopword list that excludes negation words, or skip stopword removal entirely for this analysis
stripWhitespace — collapsing spaces altered the word sequence

Answer

c) removeStopwords

English stopword lists include negation words like not, never, no, nor, neither. Removing them from text that will be analysed for meaning or sentiment is a serious error because these words reverse the polarity of surrounding words. The fix: create a custom stopword list that excludes all negation words, or skip stopword removal and rely on your analysis method to handle function words appropriately.

Tokenisation with `quanteda`

Section Overview

What you will learn: What tokenisation is; the difference between word, sentence, and character tokenisation; how to use quanteda’s tokens() function with various options; and how to inspect, filter, and work with the resulting token objects

What Is Tokenisation?

Tokenisation is the process of splitting a text into a sequence of discrete units called tokens. A token is typically a word, but it can also be a sentence, character, n-gram, or any other unit depending on your analytical goal.

Tokenisation options in `quanteda`
Unit	Function	Returns	Typical use
Sentence	`quanteda::tokenize_sentence()`	List of sentence strings	Sentence-level analysis, KWIC
Word	`quanteda::tokens(what = "word")`	`tokens` object	Frequency analysis, collocations
Character	`quanteda::tokens(what = "character")`	`tokens` object	Character n-grams, orthographic analysis
N-gram	`quanteda::tokens_ngrams()`	`tokens` object	Collocation, language models

Sentence Tokenisation

Code

# Split text into sentences
et_sentences <- quanteda::tokenize_sentence(exampletext) |> unlist()
et_sentences

[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Code

# Works on a vector of texts too
multi_sent <- quanteda::tokenize_sentence(
  c(exampletext, additionaltext)
)
lengths(multi_sent)   # how many sentences per text?

[1] 3 4

Word Tokenisation

Code

# Build a quanteda corpus first
corp <- quanteda::corpus(
  c(exampletext, additionaltext),
  docnames = c("grammar", "saussure")
)

# Default word tokenisation (preserves punctuation)
toks_default <- quanteda::tokens(corp, what = "word")
head(as.character(toks_default[[1]]), 20)

 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"
[11] "and"        "use"        "of"         "utterances" "in"        
[16] "a"          "given"      "language"   "."          "These"

Code

# Clean tokenisation: remove punctuation, symbols, numbers, URLs
toks_clean <- quanteda::tokens(
  corp,
  what           = "word",
  remove_punct   = TRUE,
  remove_symbols = TRUE,
  remove_numbers = FALSE,
  remove_url     = TRUE,
  split_hyphens  = FALSE   # keep "well-known" as one token
)
head(as.character(toks_clean[[1]]), 20)

 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"
[11] "and"        "use"        "of"         "utterances" "in"        
[16] "a"          "given"      "language"   "These"      "rules"

Code

# Token counts
lengths(toks_clean)

 grammar saussure 
      81      111

Removing Stopwords in `quanteda`

Code

# quanteda has built-in stopword lists
head(quanteda::stopwords("en"), 20)

 [1] "i"          "me"         "my"         "myself"     "we"        
 [6] "our"        "ours"       "ourselves"  "you"        "your"      
[11] "yours"      "yourself"   "yourselves" "he"         "him"       
[16] "his"        "himself"    "she"        "her"        "hers"

Code

# Remove stopwords from tokens object
toks_nostop <- quanteda::tokens_remove(
  toks_clean,
  pattern = quanteda::stopwords("en"),
  padding = FALSE   # TRUE replaces removed tokens with "" (preserves positions)
)

head(as.character(toks_nostop[[1]]), 20)

 [1] "Grammar"      "system"       "rules"        "governs"      "production"  
 [6] "use"          "utterances"   "given"        "language"     "rules"       
[11] "apply"        "sound"        "well"         "meaning"      "include"     
[16] "componential" "subsets"      "rules"        "pertaining"   "phonology"

Code

# Compare token counts before/after stopword removal
data.frame(
  text     = names(toks_clean),
  with_sw  = lengths(toks_clean),
  without_sw = lengths(toks_nostop)
) |>
  dplyr::mutate(pct_removed = round(100 * (1 - without_sw / with_sw), 1))

             text with_sw without_sw pct_removed
grammar   grammar      81         45        44.4
saussure saussure     111         64        42.3

Selecting and Filtering Tokens

Code

# Keep only tokens matching a pattern
toks_nouns <- quanteda::tokens_select(
  toks_clean,
  pattern   = c("grammar", "syntax", "morphology", "phonology",
                "language", "linguistic*"),   # * is a glob wildcard
  valuetype = "glob"
)
as.character(toks_nouns[[1]])

[1] "Grammar"     "language"    "phonology"   "morphology"  "syntax"     
[6] "grammar"     "linguistics"

Code

# tokens_select with regex
toks_ing <- quanteda::tokens_select(
  toks_clean,
  pattern   = "\\w+ing",
  valuetype = "regex"
)
as.character(toks_ing[[1]])

[1] "meaning"     "pertaining"  "linguistics"

N-Grams

N-grams are consecutive sequences of n tokens. Bigrams (n=2) and trigrams (n=3) are especially useful for collocation analysis and language modelling.

Code

# Extract bigrams
toks_bigrams <- quanteda::tokens_ngrams(toks_nostop, n = 2)
head(as.character(toks_bigrams[[1]]), 15)

 [1] "Grammar_system"       "system_rules"         "rules_governs"       
 [4] "governs_production"   "production_use"       "use_utterances"      
 [7] "utterances_given"     "given_language"       "language_rules"      
[10] "rules_apply"          "apply_sound"          "sound_well"          
[13] "well_meaning"         "meaning_include"      "include_componential"

Code

# Skipgrams: pairs with up to k tokens skipped between them
toks_skip2 <- quanteda::tokens_ngrams(toks_nostop, n = 2, skip = 0:2)
head(as.character(toks_skip2[[1]]), 15)

 [1] "Grammar_system"        "Grammar_rules"         "Grammar_governs"      
 [4] "system_rules"          "system_governs"        "system_production"    
 [7] "rules_governs"         "rules_production"      "rules_use"            
[10] "governs_production"    "governs_use"           "governs_utterances"   
[13] "production_use"        "production_utterances" "production_given"

Code

# Convert to a document-feature matrix for analysis
dfm_bigrams <- quanteda::dfm(toks_bigrams)
# Top features by frequency
quanteda::topfeatures(dfm_bigrams, n = 10)

         system_rules formation_composition    chomsky_competence 
                    2                     2                     2 
       grammar_system         rules_governs    governs_production 
                    1                     1                     1 
       production_use        use_utterances      utterances_given 
                    1                     1                     1 
       given_language 
                    1

Document-Feature Matrix (DFM)

The document-feature matrix (DFM) represents a corpus as a matrix where rows are documents and columns are features (tokens). It is the standard input for most corpus-statistical analyses.

Code

# Build DFM from clean tokens
dfm_clean <- quanteda::dfm(toks_clean)
dfm_clean

Document-feature matrix of: 2 documents, 111 features (42.34% sparse) and 0 docvars.
          features
docs       grammar is a system of rules which governs the production
  grammar        2  1 2      1  8     3     1       1   5          1
  saussure       1  4 1      1  5     1     1       0   6          0
[ reached max_nfeat ... 101 more features ]

Code

# Dimensions: documents × features
dim(dfm_clean)

[1]   2 111

Code

# Top features across the corpus
quanteda::topfeatures(dfm_clean, n = 15)

      of      the      and       in       is       to    rules  grammar 
      13       11       11        7        5        5        4        3 
       a language       as     that   langue   parole   system 
       3        3        3        3        3        3        2

Code

# Weight by TF-IDF (downweights features common across all documents)
dfm_tfidf <- quanteda::dfm_tfidf(dfm_clean)
quanteda::topfeatures(dfm_tfidf, n = 10)

         as      langue      parole       sound   formation composition 
     0.9031      0.9031      0.9031      0.6021      0.6021      0.6021 
    between         his   according    specific 
     0.6021      0.6021      0.6021      0.6021

Code

# Simple frequency plot
top15 <- quanteda::topfeatures(dfm_clean, n = 15)
data.frame(word = names(top15), freq = top15) |>
  ggplot(aes(x = reorder(word, freq), y = freq)) +
  geom_col(fill = "steelblue", color = "white") +
  coord_flip() +
  theme_bw() +
  labs(title = "Top 15 tokens in example corpus",
       x = "Token", y = "Frequency")

Your turn!

Q9 You tokenise a text with quanteda::tokens(corp, remove_punct = TRUE) and then run tokens_remove(toks, stopwords("en"), padding = TRUE). What does padding = TRUE do?

Q10 What is a document-feature matrix (DFM), and which of the following correctly describes its structure?

Challenge!

Q11 How many word tokens does linguistics04.txt contain?

Show solution

readLines(here::here("data/testcorpus/linguistics04.txt")) |>
  paste(collapse = " ") |>
  str_split("\\s+") |>
  unlist() |>
  length()

Q12 How many individual characters does linguistics04.txt contain?

Show solution

readLines(here::here("data/testcorpus/linguistics04.txt")) |>
  paste(collapse = " ") |>
  strsplit("") |>
  unlist() |>
  length()

Citation and Session Info

Citation

Martin Schweinberger. 2026. String Processing in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/string/string.html (Version 3.1.1). doi: 10.5281/zenodo.19332969.

@manual{martinschweinberger2026string,
  author       = {Martin Schweinberger},
  title        = {String Processing in R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/string/string.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.03.28}
  doi      = {}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.11 here_1.0.2       checkdown_0.0.13 udpipe_0.8.11   
 [5] tm_0.7-16        NLP_0.3-2        quanteda_4.2.0   lubridate_1.9.4 
 [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.2.0      purrr_1.0.4     
[13] readr_2.1.5      tidyr_1.3.2      tibble_3.2.1     ggplot2_4.0.2   
[17] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.56              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          tzdb_0.4.0             
 [7] vctrs_0.7.1             tools_4.4.2             generics_0.1.3         
[10] parallel_4.4.2          klippy_0.0.0.9500       pkgconfig_2.0.3        
[13] Matrix_1.7-2            data.table_1.17.0       RColorBrewer_1.1-3     
[16] S7_0.2.1                assertthat_0.2.1        uuid_1.2-1             
[19] lifecycle_1.0.5         compiler_4.4.2          farver_2.1.2           
[22] textshaping_1.0.0       codetools_0.2-20        litedown_0.9           
[25] fontquiver_0.2.1        fontLiberation_0.1.0    SnowballC_0.7.1        
[28] htmltools_0.5.9         yaml_2.3.10             crayon_1.5.3           
[31] pillar_1.10.1           openssl_2.3.2           fontBitstreamVera_0.1.1
[34] commonmark_2.0.0        stopwords_2.3           zip_2.3.2              
[37] tidyselect_1.2.1        digest_0.6.39           stringi_1.8.4          
[40] slam_0.1-55             labeling_0.4.3          rprojroot_2.1.1        
[43] fastmap_1.2.0           grid_4.4.2              cli_3.6.4              
[46] magrittr_2.0.3          patchwork_1.3.0         withr_3.0.2            
[49] gdtools_0.5.0           scales_1.4.0            timechange_0.3.0       
[52] rmarkdown_2.30          officer_0.7.3           ragg_1.3.3             
[55] askpass_1.2.1           hms_1.1.3               evaluate_1.0.3         
[58] knitr_1.51              markdown_2.0            rlang_1.1.7            
[61] Rcpp_1.1.1              glue_1.8.0              xml2_1.3.6             
[64] renv_1.1.7              rstudioapi_0.17.1       jsonlite_1.9.0         
[67] R6_2.6.1                systemfonts_1.3.1

AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

Back to top

Back to LADAL home

Introduction

Setup

Installing Packages

Loading Packages

Loading Example Texts

Base R String Functions

Case Conversion

String Length

Substrings

Combining Strings

Pattern Matching and Replacement

Splitting Strings

Core stringr Functions

Detecting Patterns

Extracting Patterns

Replacing and Removing Patterns

Splitting Strings

Subsetting Strings

Padding, Whitespace, and Truncation

Combining and Interpolating Strings

str_c() and str_flatten()

str_glue(): String Interpolation

str_glue_data(): Interpolation Over a Data Frame

Sorting and Ordering

Working with Factors as Strings

Factors vs. Character Vectors

The forcats Package

Reordering Levels

Relabelling Levels

Collapsing and Lumping Levels

Adding and Dropping Levels

Unicode, Encoding, and Non-ASCII Characters

What Is Text Encoding?

Detecting and Converting Encodings

IPA and Phonetic Symbols

Non-Latin Scripts

Locale-Aware Case Conversion

Regular Expressions

Special Characters and Escaping

Character Classes

Quantifiers

Anchors and Word Boundaries

Alternation and Groups

Named Capture Groups

Lookahead and Lookbehind

Practical Regex for Linguistic Data

Text Cleaning Pipelines

Why Build a Pipeline?

The tm Building Blocks

A Reusable tm-Based Pipeline

A stringr-Based Pipeline

Applying a Pipeline to a Corpus

Tokenisation with quanteda

What Is Tokenisation?

Sentence Tokenisation

Word Tokenisation

Removing Stopwords in quanteda

Selecting and Filtering Tokens

N-Grams

Document-Feature Matrix (DFM)

Challenge!

Citation and Session Info

References

Core `stringr` Functions

`str_c()` and `str_flatten()`

`str_glue()`: String Interpolation

`str_glue_data()`: Interpolation Over a Data Frame

The `forcats` Package

The `tm` Building Blocks

A Reusable `tm`-Based Pipeline

A `stringr`-Based Pipeline

Tokenisation with `quanteda`

Removing Stopwords in `quanteda`